library(tidyverse)
library(ggbeeswarm)
source("./scripts/utils.R")
theme_set(theme_linedraw())
df <- readr::read_csv("./data/final_project_train.csv", col_names = TRUE)
sel_num <- df %>% select_if(is_double) %>% select(!rowid) %>% select(sort(names(.))) %>% colnames()
sel_cat <- df %>% select(!one_of(sel_num)) %>% select(!rowid) %>% colnames()
df_eda <- df %>% mutate(
  log_response = log(response),
  binary_outcome = ifelse(outcome=="event", 1, 0))

This section continues the EDA process. In part 1A we explored the data distribution of the dataset. Here in 1B we will explore the covariation between inputs and output.


Continuous Response

Response vs Continuous Input

Examine the response with regard to every continuous input.

p <- df_eda %>% add_column(xs_98=NA, xs_99=NA) %>%
  pivot_longer(starts_with("x")) %>%
  ggplot(aes(x=value, y=response)) +
  geom_point(size=0.5, alpha=0.2) +
  geom_smooth(formula=y~x, method="lm") +
  geom_smooth(formula=y~x, method="loess", color="darkorange",fill="darkorange") +
  facet_wrap(~name, scales="free", ncol=8) +
  xlab("")
plot_grid(p)
## Loading required package: grid

  • There are correlations between many inputs and the response.
    • Indicate these inputs could be used to predict the response.
    • Since we’ll log-transform the response to remove lower bound for modelling, will examine again at log-transformed scale.

Examine the log-transformed response with regard to every continuous input.

p <- df_eda %>% add_column(xs_98=NA, xs_99=NA) %>%
  pivot_longer(starts_with("x")) %>%
  ggplot(aes(x=value, y=log_response)) +
  geom_point(size=0.5, alpha=0.2) +
  geom_smooth(formula=y~x, method="lm") +
  geom_smooth(formula=y~x, method="loess", color="darkorange",fill="darkorange") +
  facet_wrap(~name, scales="free", ncol=8) +
  xlab("")
plot_grid(p)

  • Between inputs and the log-transformed response:
    • The lexicon derived features xa, xb, xn at index 01, 02, 04, 06, 07 seem to be linear-correlated with log-response.
    • Inputs that worth adding non-linearity to: [feature-engineering]
      • The lexicon derived features xa, xb, xn at index 03, 05, 08
      • The NLP derived features xs index 01, 04, 06
      • The word count features xw index 01, 02, 03

Response vs Categorical Input

Check how each category level correlates with the log-transformed response.

df_eda %>%
  pivot_longer(c(region,customer), values_to="level") %>%
  ggplot(aes(x=level, y=log_response, color=level)) +
  geom_boxplot(outlier.size = 0.5) +
  geom_beeswarm(priority="none", size=0.5, alpha=0.5) +
  facet_grid(~name, scales="free_x", space="free_x") +
  theme_linedraw() + xlab("") + theme(legend.position="none")

  • It appears the categorical inputs have impact on the response.
    • The distribution across region have similar structure, the difference is the mean reponse time. Consider adding region additively. [feature-engineering]

For each region, check how response differ across customers.

df_eda %>%
  ggplot(aes(x=customer, y=log_response, color = customer)) +
  geom_boxplot() +
  geom_beeswarm(priority="none", size=0.5, alpha=0.5) +
  facet_wrap(~region) + xlab("")

  • At each region, response differs by customer.
    • On average the least response for customers Other in ZZ region.
    • On average the most response is on customer D in XX region.
    • Consider interacting the two categorical inputs but some customers are absent in a region, which could cause issue. [feature-engineering]

For each customer, check if response differ by regions.

df_eda %>%
  ggplot(aes(x=region, y=log_response, color = region)) +
  geom_boxplot() +
  geom_beeswarm(priority="none", size=0.5, alpha=0.5) +
  facet_wrap(~customer) + xlab("")

  • This basically reflects how the response correlates to region in the prevoius plot.

Response vs ALL Inputs

Plot log-transformed response for each continuous input, conditioned on the region.

p <- df_eda %>% add_column(xs_98=NA, xs_99=NA) %>%
  pivot_longer(starts_with("x")) %>%
  ggplot(aes(x=value, y=log_response, color=region)) +
    # geom_boxplot(aes(group=cut_number(value,12), y=response), size=0.2, alpha=0.1) +
    geom_point(size=0.5, alpha=0.2) +
    geom_smooth(formula=y~x, method="loess") +
    facet_wrap(~name, scales="free", ncol=8) +
    xlab("")
plot_grid(p)

  • The region variable affects the relationship between certain inputs and the response.
    • It totally changes the direction of the correlation between inputs xa_04, xa_05, xa_08, xs_01, xs_02 and response.
    • This suggests that at different region, the same inputs could have contrasting effect on response. [interpretation]
    • Consider interacting region to the inputs. [feature-engineering]

Plot log-transformed response from each continuous input, conditioned on customer.

p <- df_eda %>% add_column(xs_98=NA, xs_99=NA) %>%
  pivot_longer(starts_with("x")) %>%
  ggplot(aes(x=value, y=log_response,color=customer)) +
    # geom_boxplot(aes(group=cut_number(value,12), y=response), size=0.2, alpha=0.1) +
    geom_point(size=0.5, alpha=0.2) +
    geom_smooth(formula=y~x,method="lm", alpha=0.5, se=F) +
    facet_wrap(~name, scales="free", ncol=8) +
    xlab("")
plot_grid(p)

  • Just like region, the customer variable also affects how certain inputs could predict the response.
    • Consider interacting customer to the inputs. [feature-engineering]

Binary Outcome

Outcome vs Continuous Input

Check how the binary outcome is affected by the continuous inputs.

p = df_eda %>% add_column(xs_98=NA, xs_99=NA) %>%
  pivot_longer(starts_with("x")) %>%
  ggplot(aes(x=value, y=binary_outcome)) +
    geom_point(size=0.5, alpha=0.1) +
    geom_smooth(formula=y~x, method="glm", method.args=list(family=binomial)) +
    facet_wrap(~name, scales="free", ncol=8) +
    theme_linedraw() + xlab("")
plot_grid(p)

  • As shown by the smoother, some inputs have clear relation to the binary outcome.
    • For most inputs, low input values are associate with higher event probability.
    • This is not true for xs_04,xs_05,xs_06, and all xw input features, which does not seem to affect event probability too much. [interpretation]

Examine how the response affects outcome.

df_eda %>%
  pivot_longer(c("log_response", "response")) %>%
  ggplot(aes(x=value, y=binary_outcome)) +
    geom_point(size=0.5, alpha=0.1) +
    geom_smooth(formula=y~x, method="glm", method.args=list(family=binomial)) +
    facet_wrap(~name, scales="free", ncol=8) + xlab("")

  • The response does have some impact on the outcome.
    • Lower response associated with higher event probability.
    • Consider using response as predictor for outcome? [feature-engineering]

Outcome vs Categorical Input

Check how the event probability is related to the customer in each region.

df_eda %>% select(c(all_of(sel_cat),"outcome","binary_outcome")) %>%
  ggplot(aes(x=customer)) +
  geom_bar(aes(fill=outcome),position="fill") +
  geom_jitter(aes(y=binary_outcome), height=0, alpha=0.2) +
  facet_wrap(~region)

  • It appears that region and customer does have impact on event probability.
    • Customer G in region XX has highest event probability, but fairly low in region ZZ.
    • Customer E has almost zero event probability in region YY, but has some in region XX.
    • Consider interacting the two categorical inputs. [feature-engineering]

Outcome vs ALL Inputs

Examine outcome based on continuous input, conditioned on region.

p = df_eda %>% add_column(xs_98=NA, xs_99=NA) %>%
  pivot_longer(starts_with("x")) %>%
  ggplot(aes(x=value, y=binary_outcome, color=region)) +
    geom_point(size=0.5, alpha=0.1) +
    geom_smooth(formula=y~x, method="glm", method.args=list(family=binomial)) +
    facet_wrap(~name, scales="free", ncol=8) + xlab("")
plot_grid(p)

  • The region affects how some continuous inputs could predict the outcome.
    • Smoothed event probability trends for inputs xs_04, xs_05, xs_06 and xw inputs seem to have different direction depending on region.
    • Consider interacting region with inputs to predict outcome. [feature-engineering]

Examine outcome based on continuous input, conditioned on customer.

p = df_eda %>% add_column(xs_98=NA, xs_99=NA) %>%
  pivot_longer(starts_with("x")) %>%
  ggplot(aes(x=value, y=binary_outcome, color=customer)) +
    geom_point(size=0.5, alpha=0.1) +
    geom_smooth(formula=y~x, method="glm",
                method.args=list(family=binomial),
                se=F) +
    facet_wrap(~name, scales="free", ncol=8) + xlab("")
plot_grid(p)

  • Similarly, the customer variable affects how some continuous input predict event probability.
    • Consider interacting customer with inputs to predict outcome. [feature-engineering]